In this short article, I wanted to get familiar with some libraries I might need to use in my thesis project. I also wanted to design an architecture and train a first autoencoder so that I get in touch with the basic concepts and build on it.
Like I mentionned, the objective of this article is to getting started. As such, I selected a synthetic dataset that has been widely used in different unsupervised experiments. This dataset, called chainlink, will allow us to benchmark some preliminary experiments. This 3 dimensionnal dataset is available from that link. Let’s first import our dataset:
import pandas
dt = pandas.read_csv('src/chainlink/chainlink.csv')
The animation below illustrates that dataset (along with the 2 different groups) is a 3D plan:
The main challenge with this task is that it is linearly not separable. As such, some traditionnal clustering methods with struggle identifying the 2 different groups. In the next sections, we will apply some traditionnal clustering methods and also have a first glimspe on how autoencoders could be involved in clustering tasks.
One well known and used clustering method is indeed the \(k\)-means. Given an initial set of \(k\) centroids, this method uses an iterative process where it assigns each observation to its closest centroid and then recompute those centroids.
Here is the